Privacy-Preserving Sharing of Horizontally-Distributed Private Data for Constructing Accurate Classifiers

نویسندگان

  • Vincent Yan Fu Tan
  • See-Kiong Ng
چکیده

Data mining tasks such as supervised classification can often benefit from a large training dataset. However, in many application domains, privacy concerns can hinder the construction of an accurate classifier by combining datasets from multiple sites. In this work, we propose a novel privacy-preserving distributed data sanitization algorithm that randomizes the private data at each site independently before the data is pooled to form a classifier at a centralized site. Distance-preserving perturbation approaches have been proposed by other researchers but we show that they can be susceptible to security risks. To enhance security, we require a unique non-distance-preserving approach. We use Kernel Density Estimation (KDE) Resampling, where samples are drawn independently from a distribution that is approximately equal to the original data’s distribution. KDE Resampling provides consistent density estimates with randomized samples that are asymptotically independent of the original samples. This ensures high accuracy, especially when a large number of samples is available, with low privacy loss. We evaluated our approach on five standard datasets in a distributed setting using three different classifiers. The classification errors only deteriorated by 3% (in the worst case) when we used the randomized data instead of the original private data. With a large number of samples, KDE Resampling effectively preserves privacy (due to the asymptotic independence property) and also maintains the necessary data integrity for constructing accurate classifiers (due to consistency).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Poster: Differentially Private Decision Tree Learning from Distributed Data

The goal of privacy preserving data sharing is to share data for further analysis without revealing sensitive information. In this work, we propose a new Secure Multi-Party Computation (SMPC) protocol using Differential Privacy (DP) to protect data privacy while applying decision tree algorithm to horizontally distributed data. Pure secure multiparty computation approaches (using cryptographic ...

متن کامل

On A New Scheme on Privacy Preserving Data Classification

We address the privacy preserving data classification problem in a distributed system. Randomization has been proposed to preserve privacy in such circumstances. However, this approach was challenged in [12] by a privacy intrusion technique that is capable of reconstructing the private data in a relative accurate manner. In this paper, we introduce an algebraic technique based scheme. Compared ...

متن کامل

Privacy Preserving Naïve Bayes Classifier for Horizontally Distribution Scenario Using Un-trusted Third Party

The aim of the classification task is to discover some kind of relationship between the input attributes and the output class, so that the discovered knowledge can be used to predict the class of a new unknown tuple. The problem of secure distributed classification is an important one. In many situations, data is split between multiple organizations. These organizations may want to utilize all ...

متن کامل

Privacy Preserving Data Mining For Horizontally Distributed Medical Data Analysis

To build reliable prediction models and identify useful patterns, assembling data sets from databases maintained by different sources such as hospitals becomes increasingly common; however, it might divulge sensitive information about individuals and thus leads to increased concerns about privacy, which in turn prevents different parties from sharing information. Privacy Preserving Distributed ...

متن کامل

Performance Analysis of Privacy Preserving Naïve Bayes Classifiers for Distributed Databases

The problem of secure and fast distributed classification is an important one. The main focus of the paper is on privacy preserving distributed classification rule mining. This research paper addresses the performance analysis of privacy preserving Naïve Bayes classifiers for horizontal and vertical partitioned databases. The Naïve Bayes classifier is a simple but efficient baseline classifier....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007